The dataset contains quarterly tourist arrivals from the United States between 1981 and 2012, however, we missed data of the last quarter 2012, so I drop the data of 2012.
It includes one time series variable
(US arrivals), measured four times per year
(quarterly).
The data were originally obtained from the fpp2 package and are expressed in thousands of visitors.
library(readr)
library(fpp2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──
## ✔ ggplot2 4.0.0 ✔ fma 2.5
## ✔ forecast 8.24.0 ✔ expsmooth 2.3
##
library(TTR)
arrivals_data_csv <- read_csv("arrivals_quarterly.csv")
## Rows: 127 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Quarter
## dbl (4): Japan, NZ, UK, US
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(arrivals_data_csv)
## # A tibble: 6 × 5
## Quarter Japan NZ UK US
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1981 Q1 14.8 49.1 45.3 32.3
## 2 1981 Q2 9.32 87.5 19.9 23.7
## 3 1981 Q3 10.2 85.8 24.8 24.5
## 4 1981 Q4 19.5 61.9 52.3 33.4
## 5 1982 Q1 17.1 42.0 53.6 33.5
## 6 1982 Q2 10.6 63.1 34.8 28.4
spec(arrivals_data_csv)
## cols(
## Quarter = col_character(),
## Japan = col_double(),
## NZ = col_double(),
## UK = col_double(),
## US = col_double()
## )
arrivals_ts_st <- ts(arrivals_data_csv$US, start=c(1981,1), frequency=4)
# drop the data of 2012 since it only has 3 quarter
arrivals_ts <- window(arrivals_ts_st, end = c(2011, 4))
head(arrivals_ts)
## Qtr1 Qtr2 Qtr3 Qtr4
## 1981 32.316 23.721 24.533 33.438
## 1982 33.527 28.366
tail(arrivals_ts)
## Qtr1 Qtr2 Qtr3 Qtr4
## 2010 111.615 127.002
## 2011 125.264 101.814 101.925 127.150
plot(arrivals_ts)
The time series of U.S. tourist arrivals (1981–2012) shows several key characteristics:
Upward trend: There is a clear long-term
increase in tourist arrivals, particularly from the early 1980s through
the late 1990s.
Strong seasonality: Regular quarterly peaks and troughs indicate recurring seasonal patterns, likely associated with holiday or vacation cycles.
Increasing variability: The fluctuations become larger over time, suggesting more volatile tourism demand in later years.
Recent flattening: After around 2000, the growth trend appears to slow down or level off, possibly due to economic factors or global events affecting travel.
Overall, the series demonstrates both trend and seasonality
acf(arrivals_ts)
The Autocorrelation Function (ACF) plot for the U.S. tourist arrivals time series shows the following characteristics:
Overall, the ACF plot supports the earlier findings that the data exhibits both trend and seasonality
# Display summary statistics
summary(arrivals_ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.72 63.47 85.43 84.15 108.53 136.09
# Create a box plot for visualizing distribution
boxplot(arrivals_ts,
main = "Boxplot of U.S. Tourist Arrivals (1981–2012)",
ylab = "Number of Arrivals (in thousands)",
col = "lightblue",
border = "darkblue")
# box plot for visulizing distribution by year
years <- floor(time(arrivals_ts))
values <- as.numeric(arrivals_ts)
df <- data.frame(year = years, value = values)
boxplot(value ~ year, data = df,
main = "Boxplots by Year",
xlab = "Year", ylab = "Arrivals (thousands)",
col = "lightblue", border = "gray40", outline = TRUE, las = 2, cex.axis = 0.7)
The summary statistics for the U.S. tourist arrivals (1981–2012) are as follows:
| Statistic | Value |
|---|---|
| Minimum | 23.72 |
| 1st Quartile (Q1) | 63.95 |
| Median | 85.88 |
| Mean | 84.85 |
| 3rd Quartile (Q3) | 108.98 |
| Maximum | 136.09 |
Interpretation:
# Model output
naive_forecast <- naive(arrivals_ts,8)
plot(naive_forecast)
naive_residual = residuals(naive_forecast)
plot(naive_residual, main="Residuals from Naïve Forecast", ylab="Residuals", xlab = "Year")
abline(h=0, col="red")
The residuals are centered near zero (no major bias). These patterns indicate that the Naïve method fails to fully account for the seasonal and trend components in the data.
hist(naive_residual, main="Histogram of Naïve Forecast Residuals",
xlab="Residuals", col="lightblue", border="white")
The residuals are centered near zero, but the distribution isn’t symmetric.
That indicats the Naïve model doesn’t fit the data perfectly — some structure like seasonality is still left in the residuals.
plot(as.numeric(naive_forecast$fitted), as.numeric(naive_forecast$residuals),
main = "Residuals vs Fitted (Naïve Model)",
xlab = "fitted values", ylab = "Residuals")
abline(h = 0, col = "red")
Although the residuals are centered around zero, the distribution is not completely symmetric, indicating that the Naïve model does not fully capture the trend or seasonality. There is no clear linear pattern, suggesting that there are no strong systematic errors and the residuals are largely random in value. However, the residuals show considerable fluctuation, implying that the model’s forecasting accuracy is relatively low.
plot(as.numeric(arrivals_ts),as.numeric(naive_forecast$residuals),
main = "Actual Values vs Residuals (Naïve Model)",
xlab = "Actual Values", ylab = "Residuals",
type = "p", pch = 19, col = "darkblue", cex= 0.8)
abline(h = 0, col = "red")
The “Actual vs Residuals” plot conveys a similar interpretation as the “Fitted vs Residuals” plot — the residuals are roughly centered around zero with no strong pattern, but the wide spread suggests the Naïve model does not fully capture the underlying structure of the time series.
Acf(naive_residual, main="ACF of Residuals - Naïve Forecast")
The ACF plot of residuals shows significant autocorrelation at seasonal lags (e.g., 4, 8, 12), indicating that the Naïve model has not fully captured the underlying seasonal pattern in the time series. Therefore, the residuals are not white noise.
accuracy(naive_forecast)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.7710081 12.47908 9.706016 -0.08056945 11.91185 1.301303
## ACF1
## Training set -0.1796656
naive_forecast$mean
## Qtr1 Qtr2 Qtr3 Qtr4
## 2012 127.15 127.15 127.15 127.15
## 2013 127.15 127.15 127.15 127.15
plot(naive_forecast$mean)
autoplot(naive_forecast) +
ggtitle("Naïve Forecast for Next 8 Quarters") +
ylab("Tourist Arrivals (Thousands)") +
xlab("Year") +
theme_minimal()
The Naïve model’s accuracy is moderate. The RMSE of 12.48 and MAPE around 11.9% suggest that the model captures general stability in the series but cannot adjust for trend or seasonality.
The Naïve forecast provides a simple benchmark model with easy interpretation but limited predictive capability. It serves as a baseline for comparison against more sophisticated models such as Holt’s Trend or Holt-Winters.
# Compute moving averages of different orders
MA3_forecast <- ma(arrivals_ts, order = 3)
MA6_forecast <- ma(arrivals_ts, order = 6)
MA9_forecast <- ma(arrivals_ts, order = 9)
# Plot original time series
plot(arrivals_ts, main = "Simple Moving Averages (Orders 3, 6, 9)",
ylab = "Tourist Arrivals (Thousands)", xlab = "Year", col = "black")
# Add the three moving averages
lines(MA3_forecast, col = "red", lwd = 2)
lines(MA6_forecast, col = "blue", lwd = 2)
lines(MA9_forecast, col = "green", lwd = 2)
# Forecast next 4 quarters using order 6 (bonus)
fit_ma6 <- auto.arima(na.omit(MA6_forecast))
forecast_ma6 <- forecast(fit_ma6, h = 4)
lines(forecast_ma6$mean, col = "purple", lwd = 2)
legend("topleft",
legend = c("Original", "MA(3)", "MA(6)", "MA(9)", "Forecast MA(6)"),
col = c("black", "red", "blue", "green", "purple"),
lwd = 2, bty = "n")
Overall, increasing the order of the moving average reduces short-term noise but delays response to sudden changes in the data.
ss_forecast = ses(arrivals_ts, h = 4)
plot(ss_forecast)
ss_forecast$model
## Simple exponential smoothing
##
## Call:
## ses(y = arrivals_ts, h = 4)
##
## Smoothing parameters:
## alpha = 0.3625
##
## Initial states:
## l = 29.2613
##
## sigma: 11.0571
##
## AIC AICc BIC
## 1197.661 1197.861 1206.122
Overall, the SES model with α = 0.36 provides a moderately smoothed forecast, effectively capturing the level of the series without modeling trend or seasonality. The residual variance (σ ≈ 11) shows reasonable stability, making SES suitable for short-term level forecasting.
ss_residual = residuals(ss_forecast)
plot(ss_residual, main="Residuals from Simple Smoothing Forecast", ylab="Residuals", xlab = "Year")
abline(h=0, col="red")
The residuals fluctuate around zero, indicating no systematic bias in the model. However, the repeating peaks and troughs suggest that some seasonal pattern remains in the residuals — meaning the Simple Exponential Smoothing model has not captured the seasonality in the data.
hist(ss_residual, main = "Histogram of SS model Residual", xlab="Residuals", col="lightblue", border="white")
plot(as.numeric(ss_forecast$fitted), as.numeric(ss_forecast$residuals),
main="Fitted Values vs Residuals (SS Model)",
xlab="Fitted Values", ylab="Residuals")
abline(h=0, col="red")
The residuals are roughly centered around zero and randomly scattered, suggesting that the SS model adequately captures the general level of the data but may not fully explain trend or seasonal variation.
plot(as.numeric(arrivals_ts), as.numeric(ss_forecast$residuals),
main = "Actual Values vs Residuals (SS Model)",
xlab = "Actual Values", ylab = "Residuals",
type = "p", pch = 19, col = "darkblue", cex= 0.8)
abline(h = 0, col = "red")
The residuals show no clear pattern against actual values, though slightly larger deviations occur at higher levels, suggesting the SS model struggles to capture high-season fluctuations.
Acf(ss_residual, main="ACF of Residuals - SS Forecast")
The ACF plot of residuals still shows significant spikes at seasonal lags (for example, around lag 4, 8, 12, etc.), indicating that seasonal correlation remains in the residuals. This means that the Simple Smoothing (SS) model — which only captures the level of the time series — did not adequately model the seasonal structure present in the data.
accuracy(ss_forecast)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 1.933619 10.96758 8.848115 1.518322 10.87885 1.186282 0.06230072
ss_forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2012 Q1 116.1663 101.99601 130.3365 94.49472 137.8378
## 2012 Q2 116.1663 101.09393 131.2386 93.11511 139.2174
## 2012 Q3 116.1663 100.24287 132.0897 91.81353 140.5190
## 2012 Q4 116.1663 99.43505 132.8975 90.57808 141.7545
plot(ss_forecast)
The forecasted average value for the next year is approximately 116 (thousands), but each quarter shows different lower and upper prediction intervals (Lo/Hi 80–95%), indicating seasonal fluctuation and uncertainty around the mean forecast.
hw_forecast = hw(arrivals_ts)
forecast_hw = forecast(hw_forecast, h = 4)
forecast_hw$model
## Holt-Winters' additive method
##
## Call:
## hw(y = arrivals_ts)
##
## Smoothing parameters:
## alpha = 0.4975
## beta = 1e-04
## gamma = 0.0921
##
## Initial states:
## l = 27.0003
## b = 0.7211
## s = 6.6781 -4.2792 -6.5704 4.1715
##
## sigma: 7.8361
##
## AIC AICc BIC
## 1118.013 1119.592 1143.395
hw_residual = residuals(forecast_hw)
#plot residual
plot(hw_residual, main = "Residuals from Holt Winter Model", ylab="Residuals", xlab = "Year")
abline(h=0, col="red")
#histogram
hist(hw_residual, main = "Histogram of Holt Winter model Residual", xlab="Residuals", col="lightblue", border="white")
# fitted value vs residuals
plot(as.numeric(fitted(forecast_hw)), as.numeric(hw_residual),
main="Fitted Values vs Residuals (Holt Winter Model)",
xlab="Fitted Values", ylab="Residuals")
abline(h=0, col="red")
# real values vs residuals
plot(as.numeric(arrivals_ts), as.numeric(hw_residual),
main = "Actual Values vs Residuals (Holt Winter Model)",
xlab = "Actual Values", ylab = "Residuals",
type = "p", pch = 19, col = "darkblue", cex= 0.8)
abline(h = 0, col = "red")
#ACF plot of residuals
Acf(hw_residual, main="ACF of Residuals - Holt Winter Forecast")
The small residual spikes near zero are random and expected, indicating that the Holt-Winters model effectively captures the underlying level, trend, and seasonality. Only a few larger peaks appear, but they do not follow a systematic seasonal pattern and can be considered random noise rather than model bias.
The residual diagnostics confirm that the Holt–Winters additive model
provides a good fit for the data.
Residuals resemble white noise — centered around zero, normally
distributed, and uncorrelated —
indicating that the model effectively explains both trend and
seasonality in the tourist arrivals series.
accuracy(forecast_hw)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.003138538 7.5791 5.505299 -0.3492589 7.191839 0.7381052
## ACF1
## Training set 0.0132075
forecast_hw
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2012 Q1 125.2931 115.25079 135.3355 109.93469 140.6516
## 2012 Q2 108.0475 96.83056 119.2644 90.89269 125.2022
## 2012 Q3 113.8382 101.55815 126.1182 95.05750 132.6189
## 2012 Q4 126.0028 112.74427 139.2614 105.72561 146.2800
plot(forecast_hw)
The Holt–Winters additive model achieved strong accuracy metrics: -
ME = –0.083, RMSE = 7.58, MAE
= 5.51, MAPE = 7.18%, MASE =
0.74, ACF1 = 0.01.
These results indicate that forecast errors are small and
unbiased.
The low RMSE and MAPE show the model performs
substantially better than simpler methods such as Naïve or SES.
The near-zero ACF1 value confirms that residuals are
nearly uncorrelated, which means the model has captured most of the
systematic patterns.
The model predicts tourist arrivals (in thousands) for 2012 as follows: | Quarter | Forecasted Value | |———-|——————| | Q1 | 125.29 | | Q2 | 108.05 | | Q3 | 113.84 | | Q4 | 126.00 |
Overall, the model expects a temporary dip in mid-year (Q2–Q3)
followed by a recovery in Q4 —
a pattern consistent with the seasonal nature of U.S. tourist
arrivals.
Conclusion:
The Holt–Winters additive method delivers accurate, stable forecasts and
effectively models both trend and seasonal variations,
making it one of the most suitable approaches for this quarterly tourism
dataset.
#decomposition
stl_decomp <- stl(arrivals_ts, s.window = "periodic")
plot(stl_decomp)
attributes(stl_decomp)
## $names
## [1] "time.series" "weights" "call" "win" "deg"
## [6] "jump" "inner" "outer"
##
## $class
## [1] "stl"
The “seasonal” panel shows a strong, repeating quarterly pattern that remains stable across time. This confirms the presence of consistent seasonal variation every year.
The seasonal amplitude remains roughly constant over time — it does not increase with the overall level of the series. Hence, an additive model is appropriate.
Q1: below average tourist arrivals, Q2 sightly below average, Q3 peack tourist(summer), Q4 Above average(holiday season)
The STL decomposition clearly indicates a strong additive seasonal pattern with consistent quarterly peaks and troughs. Tourist arrivals peak in Q3 due to summer travel demand and drop in Q1 due to winter seasonality. The trend component shows a long-term upward growth, reflecting the overall increase in tourism over the decades.
# seasonal adjustment
seasadj(stl_decomp)
## Qtr1 Qtr2 Qtr3 Qtr4
## 1981 25.84092 31.65618 29.13991 27.37099
## 1982 27.05192 36.30118 35.46291 27.22599
## 1983 25.99692 40.30418 42.08291 32.04499
## 1984 36.07792 49.21218 37.66291 38.40499
## 1985 41.31692 51.00518 45.72291 59.36099
## 1986 52.90192 61.21818 53.11691 79.81099
## 1987 77.65492 72.28218 70.58291 90.09799
## 1988 72.13392 77.24118 90.87391 86.67099
## 1989 66.84592 67.75418 65.47491 61.69399
## 1990 61.16192 64.08818 63.38991 62.95099
## 1991 56.75992 64.62618 90.07991 60.26499
## 1992 63.38592 69.43618 62.47591 67.56499
## 1993 69.84792 70.65318 69.33791 71.42799
## 1994 74.95292 71.48918 71.11191 72.12299
## 1995 76.57192 75.38318 73.84091 79.09299
## 1996 78.51992 78.94018 76.01891 83.41599
## 1997 80.92992 80.99618 84.59291 83.07399
## 1998 94.69192 93.31918 87.19291 98.69299
## 1999 107.38492 100.42918 99.60791 109.63699
## 2000 112.36492 115.24818 136.43991 124.03299
## 2001 120.22692 121.40618 109.84591 94.99099
## 2002 114.65192 104.95118 106.86791 108.03599
## 2003 106.43292 102.14718 107.56791 105.97199
## 2004 107.54192 109.00018 110.49091 106.26799
## 2005 116.24992 112.47718 112.33191 105.21899
## 2006 117.13292 117.80018 112.39391 108.75699
## 2007 118.20192 116.02418 112.34491 113.12199
## 2008 121.59092 107.40618 115.84991 109.56899
## 2009 112.68292 113.44218 123.58591 130.02699
## 2010 121.05992 113.90918 116.22191 120.93499
## 2011 118.78892 109.74918 106.53191 121.08299
# Plot a line on the graph
plot(arrivals_ts)
lines(seasadj(stl_decomp), col="Red")
The seasonally adjusted series (red line) smooths out the periodic ups and downs seen in the original data, confirming that the time series exhibits strong and consistent seasonal fluctuations that significantly affect the short-term values of tourist arrivals.
# Create accuracy comparison table
model_comparison <- rbind(
naive = accuracy(naive_forecast),
ss = accuracy(ss_forecast),
ma = accuracy(forecast_ma6),
hw = accuracy(forecast_hw)
)
rownames(model_comparison) <- c( "Naive", "SES", "Moving Average","Holt Winter")
round(model_comparison[, c("RMSE","MAE","MAPE","MASE")], 3)
## RMSE MAE MAPE MASE
## Naive 12.479 9.706 11.912 1.301
## SES 10.968 8.848 10.879 1.186
## Moving Average 0.966 0.736 0.922 0.136
## Holt Winter 7.579 5.505 7.192 0.738
In this project, MASE was selected as the primary accuracy measure, as it provides a standardized comparison across models. Based on MASE, the Moving Average model achieved the best performance (MASE = 0.136), indicating that it significantly outperformed the Naïve benchmark and provided the most accurate forecasts.
| Accuracy Measure | Best Model | Worst Model | Interpretation |
|---|---|---|---|
| MASE | Moving Average | Naïve | MA performs best on a scaled basis, far outperforming all others. |
Moving Average: Lowest overall errors (RMSE = 0.966, MAPE = 0.922%), indicating best short-term predictive accuracy.
Holt-Winters:Effectively models both trend and seasonality; slightly higher error due to model complexity.
Simple Exponential Smoothing (SES):Captures short-term level changes but lacks seasonality adjustment.
Naïve Model:Baseline for comparison; highest error across all metrics.